… using the MGnify API and an interactive map widget
The MGnify API returns JSON data. The jsonapi_client package can help you load this data into Python, e.g. into a Pandas dataframe.
This example shows you how to load a MGnify AtlantECO Super Study’s data from the MGnify API and display it on an interactive world map
You can find all of the other “API endpoints” using the Browsable API interface in your web browser. The URL you see in the browsable API is exactly the same as the one you can use in this code.
This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.
Content: - Fetch AtlantECO studies data - Show study’ samples on the interactive world map - Check functional annotation terms presence/absense - GO-term - InterPro - Biosynthetic Gene Clusters (BGC)
A Super Study is a collection of MGnify Studies originating from a major project. AtlantECO is one such project, aiming to develop and apply a novel, unifying framework that provides knowledge-based resources for a better understanding and management of the Atlantic Ocean and its ecosystem services.
Fetch the Super Study’s Studies from the MGnify API, into a Pandas dataframe:
import pandas as pdfrom jsonapi_client import Session, Modifieratlanteco_endpoint ='super-studies/atlanteco/flagship-studies'with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify: studies =map(lambda r: r.json, mgnify.iterate(atlanteco_endpoint)) studies = pd.json_normalize(studies)studies[:5]
type
id
attributes.samples-count
attributes.accession
attributes.bioproject
attributes.is-private
attributes.last-update
attributes.secondary-accession
attributes.centre-name
attributes.study-abstract
attributes.study-name
attributes.data-origination
relationships.biomes.data
0
studies
MGYS00006028
991
MGYS00006028
PRJNA656268
False
2024-05-21T20:40:41
SRP278138
Biological Global Ocean Ship-based Hydrographi...
The Global Ocean Shipboard Hydrographic Invest...
Bio-GO-SHIP: Global marine 'omics studies of r...
HARVESTED
[{'id': 'root:Environmental:Aquatic:Marine', '...
1
studies
MGYS00002392
1073
MGYS00002392
PRJEB6610
False
2024-04-15T20:15:45
ERP006157
GSC
Analysis of 18S DNA in Tara Oceans Protists si...
Amplicon sequencing of Tara Oceans DNA samples...
SUBMITTED
[{'id': 'root:Environmental:Aquatic:Marine', '...
2
studies
MGYS00006613
58
MGYS00006613
PRJEB40759
False
2024-03-01T18:29:37
ERP124426
Ocean Sampling Day Consortium
Ocean Sampling Day was initiated by the EU-fun...
18S rRNA amplicon sequencing from the Ocean Sa...
SUBMITTED
[{'id': 'root:Environmental:Aquatic:Marine', '...
3
studies
MGYS00006612
48
MGYS00006612
PRJEB40763
False
2024-03-01T18:14:18
ERP124432
Ocean Sampling Day Consortium
Ocean Sampling Day was initiated by the EU-fun...
18S rRNA amplicon sequencing from the Ocean Sa...
SUBMITTED
[{'id': 'root:Environmental:Aquatic:Marine', '...
4
studies
MGYS00006611
63
MGYS00006611
PRJEB55999
False
2024-03-01T18:01:09
ERP140920
Ocean Sampling Day Consortium
Ocean Sampling Day was initiated by the EU-fun...
18S rRNA amplicon sequencing from the Ocean Sa...
SUBMITTED
[{'id': 'root:Environmental:Aquatic:Marine', '...
Show the studies’ samples on a map
We can fetch the Samples for each Study, and concatenate them all into one Dataframe. Each sample has geolocation data in its attributes - this is what we need to build a map.
It takes time to fetch data for all samples, so let’s show samples from chosen PRJEB46727 study only. This study contains assembly data https://www.ebi.ac.uk/metagenomics/studies/MGYS00005810#overview.
Let’s check whether a specific identifier is present in each sample.
We will work with MGnify analyses (MGYAs) corresponding to chosen samples. We filter analyses by - pipeline version: 5.0 - experiment type: assembly
This example shows how to process just the first 10 samples (again, because the full dataset takes a while to fetch). Firstly, get analyses for each sample.
Define functions: - identify_existance to check each analysis for identifier presence/absence. We add a column to the dataframe with a colour: blue if identifier was found and red if not. - show_on_map to plot results on the world map. Join the analyses and sample tables to have geolocation data and identifier presence data together (we’ll create a new sub-DataFrame with a subset of the fields and add them to the map).
def identify_existance(input_analyses, identifier, term): data = []with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:for idx, mgya in input_analyses.iterrows():print(f"processing {mgya.id}") analysis_identifier =map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}')) analysis_identifier = pd.json_normalize(analysis_identifier) data.append("#0000FF"if term inlist(analysis_identifier.id) else"#FF0000") presented =sum([1for i in data if i =="#0000FF"])print(f"Presented {presented} of {identifier}{term}") input_analyses.insert(2, identifier, data, True)return input_analysesdef show_on_map(input_analyses, studies_samples, identifier): df = input_analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id') df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy() df2 = df2.set_index("study") df2 = df2.rename(columns={"attributes.accession": "analysis_ID", 'relationships.study.data.id': "study_ID",'relationships.sample.data.id': "sample_ID", 'relationships.assembly.data.id': "assembly_ID" }) m = leafmap.Map(center=(0, 0), zoom=2) m.add_points_from_xy(df2, x='lon', y='lat', popup=["study_ID", "sample_ID", "assembly_ID", "analysis_ID", identifier], color_column=identifier, add_legend=False)return m
GO term
This example is written for GO-term for biotin transportGO:0015878
Other GO identifiers are available on the MGnify API.
MGnify has additional analysis of BGCs provided by Sanntis. These annotations are saved as RO-Crates objects and linked to assembly records.
The following example counts the number of truncated from beggining proteins of nearest MiBIG class Polyketide with dice distance more than 0.65. We will use gffutils for parsing GFF file.
Define a function to find GFF file in zipped archive by url:
import requestsfrom zipfile import ZipFilefrom io import BytesIOdef find_gff_file(link):# Read archive and find GFF file response = requests.get(link)# Check if the request was successful (status code 200)if response.status_code ==200:# Open the zip file from the content of the responsewith ZipFile(BytesIO(response.content)) as zip_file:# List all files in the zip archive file_list = zip_file.namelist()# Filter files with .gff extension gff_files = [file_name for file_name in file_list if file_name.endswith(".gff")]print(f"Found {gff_files}")# Read the first .gff file (you can modify this to read a specific file)if gff_files: first_gff_file = gff_files[0] gff_content = zip_file.open(first_gff_file).read()return gff_contentelse:print("No .gff files found in the zip archive.")returnNoneelse:print("Failed to fetch the zip file.")returnNone
Define a function to get counts from GFF.
import gffutilsdef get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content):if gff_content:try:with tempfile.NamedTemporaryFile(delete=False) as temp_gff_file: temp_gff_file.write(gff_content) temp_gff_file_path = temp_gff_file.name# Create a GFF database using gffutils db = gffutils.create_db( temp_gff_file_path, dbfn=':memory:', # Use an in-memory database force=True, # Overwrite if the database already exists keep_order=True, # Preserve feature order ) count =0for feature in db.all_features():if feature["nearest_MiBIG_class"][0] == nearest_MiBIG_class and\float(feature["nearest_MiBIG_diceDistance"][0]) >= nearest_MiBIG_diceDistance and\ feature["partial"][0] == partial_value: count +=1print(f"Count is {count}")return countexcept:print('Error in GFF DB')return0else:return0
Process data:
nearest_MiBIG_class ="Polyketide"nearest_MiBIG_diceDistance =0.65partial_value ="10"counts = []with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:for idx, mgya in analyses.iterrows():# get ERZ assembly accession assembly = mgya['relationships.assembly.data.id'] annotations_for_assembly = mgnify.iterate(f'assemblies/{assembly}/extra-annotations') sanntis_annotation =Nonefor item in annotations_for_assembly:if'sanntis'in item.id: sanntis_annotation = item.links.selfbreakifnot sanntis_annotation:print('Sanntis annotation was not found')continueprint(f"processing {mgya.id}{assembly}") gff_content = find_gff_file(sanntis_annotation) counts.append(get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content))
processing MGYA00593142 ERZ2945023
Found ['ERZ2945023_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589840 ERZ2945090
Found ['ERZ2945090_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589562 ERZ2945101
Found ['ERZ2945101_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589561 ERZ2944669
Found ['ERZ2944669_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589560 ERZ2944798
Found ['ERZ2944798_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589559 ERZ2944724
Found ['ERZ2944724_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589558 ERZ2944859
Found ['ERZ2944859_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589557 ERZ2944879
Found ['ERZ2944879_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589556 ERZ2945017
Found ['ERZ2945017_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589555 ERZ2944878
Found ['ERZ2944878_FASTA.gb.sanntis.full.gff']
Error in GFF DB